8 - Subclustering

Author

CDN team

Published

December 2, 2024

Introduction

In this vignette we will examine methods for increasing resolution on cell subtypes and cell states. We will compare two methods: increasing resolution and other parameters to find more clusters, and subclustering. Subclustering is the process of clustering, subsetting to one cluster, then running the clustering pipeline again. In high-dimensional datasets, especially ones with lots of technical or biological noise, focusing on specific celltypes individually improves detection of subtype and state patterns. Highly variable gene selection and latent-space calculation are both affected by noise and outliers. Subclustering can also improve computational efficiency - finding small clusters can be expensive if working with the full dataset.

However, it’s important to keep in mind that iterative subclustering can lead to “overfitting” the data. This means we might identify noise as clusters, and we will have to contend more with the “curse of dimensionality” in downstream analysis. We should always validate our clusters according to expression of marker genes, use technical replicates, bootstrapping methods, or check their existence in external datasets.

Vocabulary

Subclustering

The process of dividing a previously identified cluster into smaller, more detailed clusters (subclusters). Subclustering is used to uncover finer, often subtle distinctions within a dataset that might not be visible in an initial analysis.

Overfitting

Overfitting is when an analyst describes random noise in the data rather than underlying general relationships. Overfit models perform well on their dataset, but very poorly on other data.

Curse of Dimensionality

The term data analysts use to describe phenomena that appear when analyzing data in high-dimensional spaces. As dimensionality increases, the data can become sparse. Sparsity is problematic for statistical significance testing. Additionally, by increasing dimensions, we increase the number of false positives when using p-value thresholds.

Parameter scan

AKA parameter sweep, this is the process of systematically varying parameters in an algorithm to analyze the effects of their changes on the outcome. Parameter scans are widely used in computationaly biology to identify optimal parameters or test the stability of our models.

Key Takeaways

  • Recomputing highly variable genes at each subclustering step resets the biological universe we are looking at to the capture the “new” sources of variability.

  • Iterative subclustering is essential to uncover fine grained populations

  • In addition to finding fine-grained populations, subclustering can help create better divisions between the different “species” of celltypes and subtypes.

Libraries

### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages("Seurat")

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages("colorBlindness")

if (!requireNamespace("DT", quietly = TRUE))
    install.packages("DT")    

if (!requireNamespace("scales", quietly = TRUE))
    install.packages("scales") 

if (!requireNamespace("tictoc", quietly = TRUE))
    install.packages("tictoc") 

if (!requireNamespace("ggalluvial", quietly = TRUE))
    install.packages("ggalluvial") 

### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(devtools)
library(colorBlindness)
library(DT)
library(scales)
library(RColorBrewer)
library(scales)
library(tictoc)
library(ggalluvial)

set.seed(687)

Load data

We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from the cellxgene portal.

se <- readRDS("../data/se_lvl1.rds")

Subset T cell compartment for downstream analysis

tse <- se[, se$annotation_lvl1 == "T cells"]

Color palette

# Set color palette for cell types
pal <- paletteMartin
names(pal) <- sort(unique(se$Celltype))

donor_pal <- c(
    "#66C2A4", "#41AE76", "#238B45", "#006D2C",
    "#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
    "#FFEDA0", "#FED976", "#FEB24C", "#f79736", "#FD8D3C",
    "#FC4E2A", "#E31A1C", "#BD0026", "#ad393b", "#800000", "#800050")

names(donor_pal) <- c(
    "Normal 1", "Normal 2", "Normal 3", "Normal 4",
    "Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
    "nCoV 1", "nCoV 2", "nCoV 3", "nCoV 4", "nCoV 5",
    "nCoV 6", "nCoV 7", "nCoV 8", "nCoV 9", "nCoV 10", "nCoV 11"  
)

To get up to speed with the previous worksheets, process the data in the same way.

se <- se %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(
        method = "vst",
        nfeatures = 3000,
        verbose = FALSE) %>%
    ScaleData(verbose = FALSE, features = VariableFeatures(.))

Let’s extract the top 3000 HVGs from the whole data across all cell types.

hvg_full <- VariableFeatures(se)

Analysis

Finding rare celltypes

For many of us, our first idea for finding rare celltypes is to modulate the parameters of what we have already to find the right celltypes. As we saw in the previous clustering notebook, we can find NK and T cell clusters this way, but it still seems like there’s some heterogeneity in the clusters we’ve found. For illustration, I’ve run a parameter scan on the following variables:

  • FindVariableFeatures nfeatures
  • RunPCA npcs,
  • FindNeighbors k.param,
  • FindClusters resolution
parameter_df <- expand.grid(
  nf = c(2000, 3000, 5000),
  pc = c(20, 30, 50),
  k = c(10, 30, 50),
  res = c(0.5, 0.8, 1.2)
)

seurat_parameter_scan <- function(srobj, nf, pc, k, res) {
  srobj <- srobj %>%
    NormalizeData() %>%
    FindVariableFeatures(selection.method = "vst", nfeatures = nf) %>%
    ScaleData(features = head(VariableFeatures(object = .), nf)) %>%
    RunPCA(npcs = pc, verbose = FALSE) %>%
    FindNeighbors(dims = 1:pc, k.param = k) %>%
    FindClusters(resolution = res)

  # Extract Idents and name the output vector
  idents <- Idents(srobj)
  names(idents) <- paste("nf", nf, "pc", pc, "k", k, "res", res, sep="_")
  
  return(idents)
}

# Apply the function to each row of parameter_df and combine results into a single data frame
paramscan <- lapply(seq_len(nrow(parameter_df)), function(i) {
  params <- parameter_df[i, ]
  idents_vector <- seurat_parameter_scan(se, params$nf, params$pc, params$k, params$res)
  return(idents_vector)
})

# name cluster columns after parameters used to obtain them
names(paramscan) <- parameter_df %>%
  mutate(params = pmap_chr(., function(...) {
    cols <- colnames(parameter_df)
    values <- list(...)
    paste0(cols, values, collapse = "_")
  })) %>%
  pull(params)

saveRDS(paramscan, '../data/covid_flu_srobj_clusters_paramscan.rds')
paramscan <- readRDS('../data/covid_flu_srobj_clusters_paramscan.rds')

paramscan <- bind_cols(paramscan)

row.names(paramscan) <- colnames(se)

paramscan_long <- paramscan %>%
  rownames_to_column(var = "cell_id") %>%
  pivot_longer(
    cols = -cell_id,
    names_to = "parameter",
    values_to = "cluster"
  ) %>%
  mutate(
    nf = as.numeric(gsub(".*nf(\\d+)_.*", "\\1", parameter)),
    pc = as.numeric(gsub(".*pc(\\d+)_.*", "\\1", parameter)),
    k = as.numeric(gsub(".*k(\\d+)_.*", "\\1", parameter)),
    res = as.numeric(gsub(".*res(\\d+\\.\\d+).*", "\\1", parameter))
  ) %>% 
  group_by(parameter) %>% 
  mutate(
    n_clusts = max(as.numeric(cluster))
  ) %>%
  select(-parameter)

ggplot(paramscan_long %>% 
        select(-cell_id, -cluster) %>% 
        distinct, aes(x = factor(pc), 
        y = n_clusts, 
        fill = factor(k))) +
  geom_bar(stat='identity') +
  facet_grid(paste('k=',k) + paste('nf =',nf) ~ paste('resolution =',res)) +
  scale_y_continuous(labels = scales::label_number()) +
  scale_fill_manual(values = unname(donor_pal[c(1,6,11)])) +
  labs(
      title = "Number of clusters Across Parameters",
      x = "Clustering resolution + nPCs",
      y = "Number of clusters",
      fill = "k param") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Take a close look at the results. When we modulate nfeatures or nPCs, we don’t directly see changes in the number of celltypes found. But, like we saw in the previous sessions, modulating the k.param and the clustering resolution have outsized effects on the number of clusters found.

But if we’re looking for a rare celltype, we must include more information, right? It makes most sense to increase both the nfeatures, nPCs, AND clustering resolution to find that celltype - because we need to make sure we are including the genes that define the celltype, and making small enough clusters to be able to find it.

Comparing subclustering - using full dataset HVGs within only the subset

To do this we will focus on the T cell compartment.

First let’s process our T cell subset object with the HVG obtained from the whole dataset - hvg_full:

tse_full <- tse %>%
    NormalizeData() %>%
    ScaleData(
        verbose = FALSE,
        features = hvg_full) %>% 
    RunPCA(
        features = hvg_full,
        npcs = 20,
        verbose = FALSE) %>%
    FindNeighbors() %>%
    FindClusters(resolution = c(0.05, 0.1, 0.15, 0.2))
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26261
Number of edges: 811889

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9683
Number of communities: 4
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26261
Number of edges: 811889

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9544
Number of communities: 5
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26261
Number of edges: 811889

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9412
Number of communities: 7
Elapsed time: 2 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26261
Number of edges: 811889

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9322
Number of communities: 10
Elapsed time: 3 seconds
# Visualize these clusters
DimPlot(
    tse_full,
    reduction = 'umap',
    group.by = c("RNA_snn_res.0.05", "RNA_snn_res.0.1",
                 "RNA_snn_res.0.15", "RNA_snn_res.0.2"))

dim_full <- DimPlot(
    tse_full,
    reduction = 'umap',
    group.by = "RNA_snn_res.0.15")

Now let’s recompute the HVG for the T cell subset to capture the variability within that subset:

tse_sub <- tse %>%
    FindVariableFeatures(
        method = "vst",
        nfeatures = 3000,
        verbose = FALSE) %>%
    ScaleData(verbose = FALSE, features = VariableFeatures(.)) %>% 
    RunPCA(
        npcs = 20,
        verbose = FALSE
    ) %>%
    FindNeighbors() %>%
    FindClusters(resolution = c(0.05, 0.1, 0.15, 0.2)) %>%
    RunUMAP(dims = 1:20)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26261
Number of edges: 814437

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9660
Number of communities: 4
Elapsed time: 4 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26261
Number of edges: 814437

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9502
Number of communities: 5
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26261
Number of edges: 814437

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9381
Number of communities: 8
Elapsed time: 3 seconds
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 26261
Number of edges: 814437

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9287
Number of communities: 10
Elapsed time: 3 seconds
# Visualize these clusters
DimPlot(
    tse_sub,
    reduction = 'umap',
    group.by = c("RNA_snn_res.0.05", "RNA_snn_res.0.1",
                 "RNA_snn_res.0.15", "RNA_snn_res.0.2"))

dim_sub <- DimPlot(
    tse_sub,
    reduction = 'umap',
    group.by = "RNA_snn_res.0.15")

Right off the bat, how do these UMAPs compare to each other

dim_full + dim_sub

What you’re probably wondering first is, hey would I have found these clusters if I had just increased the number of clusters I used?

data_alluvial <- data.frame(
    bc = colnames(tse_full),
    full_hvg = tse_full$RNA_snn_res.0.15,
    sub_hvg = tse_sub[, colnames(tse_full)]$RNA_snn_res.0.15) %>% 
    dplyr::count(full_hvg, sub_hvg)

ggplot(data = data_alluvial,
       aes(axis1 = full_hvg, axis2 = sub_hvg, y = n)) +
    geom_alluvium(aes(fill = full_hvg), width = 0.1) +
    geom_stratum(width = 0.1) +
    geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
    theme_minimal() +
    labs(title = "Alluvial plot of Clustering using full data HVG or  subcluster HVG",
         x = "Cluster assignment",
         y = "N")

Both clustering resolutions seem to be very similar but using specific HVG we are able to detect one more cluster at the same resolution and, in theory, the clusters should be more specific.

If we check for example the overlap of highly variable features between the two methods, we’ll find there are very few overlapping

table(VariableFeatures(tse_sub) %in% hvg_full)

FALSE  TRUE 
 1064  1936 
head(setdiff(VariableFeatures(tse_sub),  hvg_full), n = 50)
 [1] "BTK"           "PTMA"          "UQCRH"         "CLIC4"         "MBOAT2"        "COL9A3"        "MSRB3"         "MAP7"          "JUP"           "TIMM13"        "NDUFS6"        "SLC25A3"       "SLC45A3"       "MTDH"          "MRPL52"        "FKBP1A"        "PSMB3"         "HNRNPD"        "ZNF706"        "ATP5MF"        "YBX3"          "XRCC5"         "HDGF"          "ACTR3"         "SAPCD2"        "ATP5PF"        "COX7B"         "ANP32A"        "ATP5F1C"       "CHCHD2"        "SRSF3"         "VCP"           "SSR1"          "LCA5"          "SMS"           "ERH"           "CFL1"          "RPLP0"         "UQCRC1"        "CALM3"         "TIMM8B"        "NDUFAB1"       "PSMA7"         "COX8A"         "NCOA3"         "RP11-281P23.1" "EIF5A"         "UQCRQ"         "GAS2L3"        "PSMB8"        
intersect(VariableFeatures(tse_sub),  hvg_full)
   [1] "JCHAIN"                 "IGKC"                   "MZB1"                   "IGHG4"                  "HBB"                    "IGHG1"                  "IGHG3"                  "HBA2"                   "IGLC2"                  "IGHA1"                  "H4C3"                   "HBA1"                   "ITM2C"                  "UBE2C"                  "IGLC3"                  "TUBA1B"                 "IGHA2"                  "TNFRSF17"               "HSP90B1"                "IGHG2"                  "CD79A"                  "STMN1"                  "DERL3"                  "RRM2"                   "TYMS"                   "TOP2A"                  "MKI67"                  "IGHM"                   "PCLAF"                  "PPBP"                   "TUBB"                   "POU2AF1"                "SEC11C"                 "HSPA1A"                 "PTGDS"                  "CCL3"                   "MYBL2"                  "LYZ"                    "PCNA"                   "FABP5"                  "BIRC5"                  "S100A8"                 "IGLL5"                  "S100A9"                 "CCL4L2"                 "NUSAP1"                 "HMGB2"                 
  [48] "IFI27"                  "CENPF"                  "CCNB1"                  "FAM30A"                 "H1-5"                   "IGLV3-1"                "CCNB2"                  "CCL20"                  "PLK1"                   "CDC20"                  "IGLV6-57"               "H2AZ1"                  "IGHV6-1"                "IGKV4-1"                "SSR3"                   "SOX4"                   "TK1"                    "DUT"                    "PRDX4"                  "CCL4"                   "HSPD1"                  "IFIT2"                  "IGKV3-15"               "PF4"                    "ASPM"                   "UCHL1"                  "MYDGF"                  "CKS2"                   "EAF2"                   "IGKV3-20"               "S100B"                  "TNFRSF13B"              "IGHV4-39"               "IGKV3-11"               "CDK1"                   "CDKN3"                  "MCM7"                   "IGHV3-7"                "IGHGP"                  "ZWINT"                  "AQP3"                   "TUBA1C"                 "HMGA1"                  "IGLV3-19"               "PRDX1"                  "GAPDH"                  "COTL1"                 
  [95] "PPIB"                   "MT1G"                   "UBE2J1"                 "COBLL1"                 "IGHV3-72"               "ALAS2"                  "CKS1B"                  "CENPM"                  "CDT1"                   "HSPA6"                  "CCNA2"                  "TPX2"                   "IGKV1-39"               "CST3"                   "BASP1"                  "GINS2"                  "CLSPN"                  "CA1"                    "SMIM24"                 "SMC4"                   "RNASE6"                 "S100A12"                "SDC1"                   "HMMR"                   "CCND2"                  "LINC01781"              "PDIA4"                  "PTTG1"                  "FKBP11"                 "TUBB4B"                 "CD38"                   "SPINK2"                 "LILRB4"                 "MEF2C"                  "CSF2"                   "CYTL1"                  "RP11-16E12.2"           "PDIA6"                  "MCM5"                   "TNF"                    "KPNA2"                  "MCM4"                   "CENPW"                  "DLGAP5"                 "SUB1"                   "RP11-1070N10.3"         "IGLV2-23"              
 [142] "NRGN"                   "PHGDH"                  "AREG"                   "CAV1"                   "XCL1"                   "HSPA1B"                 "PTP4A3"                 "CD27"                   "SDF2L1"                 "FCER1A"                 "HMGN2"                  "H1-3"                   "GNG4"                   "NT5DC2"                 "UBE2S"                  "NASP"                   "VCAN"                   "AURKB"                  "NME1"                   "IGLV4-69"               "RANBP1"                 "IGHV1-46"               "CENPU"                  "MCM3"                   "JPT1"                   "LMAN1"                  "DHFR"                   "RP11-1143G9.4"          "IGKV1-5"                "SRM"                    "SMC2"                   "IGLVI-70"               "ACTB"                   "AVP"                    "GAS6"                   "CPA3"                   "FOS"                    "MAD2L1"                 "ACTG1"                  "KLHL14"                 "HLA-DRA"                "IGLV1-51"               "SPC25"                  "IGLV1-40"               "CDCA5"                  "HES4"                   "H1-2"                  
 [189] "GMNN"                   "IGHV3-64D"              "CMTM5"                  "TRBV14"                 "ERG"                    "CENPA"                  "SHCBP1"                 "CPNE5"                  "IGHV1-18"               "H2AC16"                 "RP11-1I2.1"             "PRDX3"                  "HMGB1"                  "KIFC1"                  "FCRL5"                  "H2AC12"                 "CDCA8"                  "IGHV3-53"               "PDLIM1"                 "SLC25A5"                "SLC2A5"                 "HMGB3"                  "SKA3"                   "TCF4"                   "GTSE1"                  "ENO1"                   "PAICS"                  "BANK1"                  "H2AX"                   "CDO1"                   "GGH"                    "RAN"                    "CTSH"                   "KIF11"                  "TXN"                    "MYCN"                   "RPL22L1"                "IGHV3-30"               "H2AC20"                 "RPA3"                   "CENPE"                  "CDCA7"                  "SPCS3"                  "PKMYT1"                 "AHSP"                   "CDC6"                   "CD34"                  
 [236] "RGS18"                  "GPRC5D"                 "EIF4A3"                 "GP9"                    "CDCA2"                  "KIF2C"                  "IFIT3"                  "HJURP"                  "GNLY"                   "DUSP4"                  "MARCKS"                 "HELLS"                  "NCAPG"                  "IGHV5-51"               "AIF1"                   "CNRIP1"                 "H3C8"                   "EGR1"                   "KIF23"                  "CDK6-AS1"               "MANF"                   "PHF19"                  "UBE2T"                  "HSP90AA1"               "HBG2"                   "EGFL7"                  "CCT5"                   "FBXO5"                  "BEX1"                   "WARS1"                  "SH2B2"                  "PPA1"                   "CENPN"                  "FCGR2B"                 "HBD"                    "CCR10"                  "RHOB"                   "IGF1"                   "RAD51AP1"               "GATA2"                  "CDCA3"                  "ATP5F1B"                "IGHV1-24"               "LDHA"                   "CA2"                    "RNASE2"                 "XBP1"                  
 [283] "SNCA"                   "TPI1"                   "IER3"                   "CD9"                    "CDC45"                  "IGLV4-60"               "CACYBP"                 "IGFL2"                  "GZMK"                   "PMCH"                   "HLA-DRB5"               "BUB1"                   "IGHV3-66"               "IGHV3-48"               "DNAJC9"                 "PGAM1"                  "GNG7"                   "TMPO"                   "ATP5MC1"                "ASF1B"                  "IGKV2-30"               "FKBP2"                  "TXNDC5"                 "RRM1"                   "PLD4"                   "IGHV2-5"                "CALR"                   "IFNG"                   "EREG"                   "MS4A1"                  "TIMD4"                  "MCM2"                   "ATP5MC3"                "CLPTM1L"                "ANXA2"                  "ANXA5"                  "TROAP"                  "IGHV1-69D"              "PRSS57"                 "TRBV7-2"                "IGHD"                   "NUF2"                   "CAVIN1"                 "IGHV3-23"               "RGS1"                   "MRPL51"                 "PNOC"                  
 [330] "H1-0"                   "PKM"                    "UHRF1"                  "IGKV1-17"               "TUBB1"                  "CTSZ"                   "CD79B"                  "NPR3"                   "LMO2"                   "PTCRA"                  "C1QBP"                  "FCRLA"                  "ABCB9"                  "PPP1R14A"               "YWHAE"                  "KLRC1"                  "CPVL"                   "TXNDC17"                "AHCY"                   "HNRNPAB"                "PLPP5"                  "HLA-DQA1"               "TFPI"                   "TRPM4"                  "ODC1"                   "CEP55"                  "LTB"                    "PBK"                    "MT1E"                   "TRIB1"                  "E2F1"                   "HSPE1"                  "CAVIN2"                 "RPN2"                   "TKTL1"                  "MYL9"                   "CKAP2L"                 "HBM"                    "FAM111B"                "SNRNP25"                "DCTPP1"                 "MT2A"                   "CRELD2"                 "DNAAF1"                 "ENTPD1"                 "HES1"                   "PKHD1L1"               
 [377] "PPP1R14B"               "ANP32E"                 "DTYMK"                  "KLF1"                   "RRBP1"                  "PROM1"                  "LINC02573"              "NDC80"                  "ID1"                    "ENHO"                   "NPW"                    "BLNK"                   "SPC24"                  "CHEK1"                  "MAFB"                   "RASSF6"                 "CLEC10A"                "GNG11"                  "BCL11A"                 "C1QTNF4"                "MYOM2"                  "NCAPH"                  "KCNN3"                  "PSMB2"                  "FCN1"                   "TM4SF1"                 "CD1C"                   "BHLHE41"                "LMNB1"                  "AURKA"                  "HSP90AB1"               "TACC3"                  "GATA1"                  "PSMA4"                  "RGCC"                   "SLC4A1"                 "SSR4"                   "ACRBP"                  "EBNA1BP2"               "CCL3L1"                 "LINC01480"              "ORC6"                   "KNL1"                   "MAN1A1"                 "APOBEC3B"               "CD180"                  "KDELR2"                
 [424] "P2RX5"                  "P4HB"                   "E2F2"                   "SPIB"                   "ALYREF"                 "SEC61B"                 "RP11-354E11.2"          "FEN1"                   "H3C7"                   "SGO2"                   "IGHV4-34"               "JUN"                    "XCL2"                   "IMPDH2"                 "PPIA"                   "RNU12-ENSG00000270022"  "ESCO2"                  "SPI1"                   "NXF3"                   "ERLEC1"                 "DYNC2I2"                "MIR23AHG"               "IGLL1"                  "DDX39A"                 "ANLN"                   "LRRC59"                 "FOXM1"                  "CDKN1A"                 "TCL1A"                  "CXXC5"                  "HSPA5"                  "LAPTM4B"                "DEK"                    "CHCHD10"                "CH25H"                  "CKAP4"                  "HLA-DQB1"               "C2orf88"                "MS4A6A"                 "SRPRB"                  "IFIT1"                  "LGALS1"                 "SSRP1"                  "ECT2"                   "MNDA"                   "SLBP"                   "DPEP1"                 
 [471] "TMEM106C"               "IGKV3D-15"              "IGHV4-4"                "VDAC1"                  "SEC61G"                 "CXCR3"                  "HM13"                   "FCRL2"                  "OSTC"                   "CD74"                   "ANP32B"                 "DTL"                    "SAE1"                   "SEMA4A"                 "IGLV3-9"                "KIF14"                  "GBP1"                   "GGCT"                   "ATP1B3"                 "YBX1"                   "SNRPE"                  "PSME2"                  "MIF"                    "ISOC2"                  "BIK"                    "TALDO1"                 "PLS3"                   "MCM6"                   "LSM4"                   "NLRP7"                  "NPM1"                   "KIF22"                  "IDH2"                   "ACTN1"                  "NDUFB3"                 "NANS"                   "PA2G4"                  "NCF1"                   "CORO1A"                 "H1-4"                   "MX1"                    "ZNF385D"                "SGO1"                   "IGLV3-21"               "IL10"                   "APOC1"                  "TXNDC11"               
 [518] "DIAPH3"                 "IGKV1-9"                "MACROH2A1"              "HLA-DRB1"               "MRPL12"                 "KIF15"                  "NUDT1"                  "GMPPB"                  "CD19"                   "TUBG1"                  "LAMP5"                  "VIM"                    "CLEC1B"                 "TNFRSF4"                "IRF4"                   "CAPG"                   "PIM2"                   "FANCI"                  "PYCR1"                  "LRR1"                   "CRYGD"                  "DTNB-AS1"               "SKA1"                   "PMAIP1"                 "IGLC5"                  "DIPK1B"                 "PARM1"                  "MPIG6B"                 "ESAM"                   "ALDH2"                  "CCT8"                   "BUB1B"                  "PSMA5"                  "TMEM40"                 "THBS1"                  "SGK1"                   "EPCAM"                  "COX5A"                  "GIHCG"                  "DEPDC1B"                "LIG1"                   "RGS16"                  "H2AZ2"                  "HP"                     "HSPH1"                  "PARP1"                  "SELL"                  
 [565] "TCP1"                   "CCDC167"                "TRAP1"                  "RHEX"                   "H2AC6"                  "CRHBP"                  "LGALS3"                 "TMEM258"                "C3orf14"                "TRAC"                   "CADM1"                  "MCM10"                  "CBX3"                   "SKA2"                   "AIRE"                   "ZBTB32"                 "MXD3"                   "GZMA"                   "HLA-DQA2"               "H2BC7"                  "ACTG2"                  "SNRPG"                  "SPCS2"                  "IGLV2-11"               "CARHSP1"                "MLEC"                   "H2BC9"                  "PKLR"                   "RTKN2"                  "TOR3A"                  "PFN1"                   "IGHV3-43"               "JSRP1"                  "RETN"                   "H2BC8"                  "DYNLL1"                 "SLC43A3"                "IGKV1D-8"               "RFC4"                   "CHAC1"                  "HBG1"                   "CYBB"                   "DENND5B"                "IFI30"                  "SPTA1"                  "CD8A"                   "MYL4"                  
 [612] "SERPINA1"               "CBX5"                   "IGLC7"                  "CYC1"                   "FCER1G"                 "SHMT2"                  "EMID1"                  "TRAV1-1"                "IGLV3-25"               "IGLV2-14"               "LINC02446"              "PDCD5"                  "CCR7"                   "TFDP1"                  "DEPDC1"                 "IGHV4-31"               "MTHFD1"                 "LST1"                   "TTK"                    "ISG15"                  "PRDX2"                  "HNRNPA2B1"              "IGHV3-21"               "BLK"                    "POLQ"                   "CD8B"                   "LMAN2"                  "STOML2"                 "POPDC3"                 "CXCR6"                  "TMED9"                  "LIMS1"                  "CCT7"                   "PRMT1"                  "TREML1"                 "CD70"                   "PPIF"                   "CDCA4"                  "H2AC14"                 "PRC1"                   "MYC"                    "MGLL"                   "DCPS"                   "CDK4"                   "ASNS"                   "CIP2A"                  "SAC3D1"                
 [659] "FH"                     "SLC38A5"                "YWHAH"                  "IGLV2-8"                "DNMT1"                  "LMNA"                   "LYL1"                   "TRGC1"                  "FOSB"                   "IGLV1-47"               "NEK2"                   "CCDC50"                 "HTR1F"                  "ARID3A"                 "FTL"                    "LTBP1"                  "ATP5F1A"                "KIF4A"                  "ARHGAP11A"              "GADD45A"                "CTNNAL1"                "EIF1AY"                 "GUCY1B1"                "MT-ND6"                 "ATAD2"                  "ACY3"                   "IGKV1D-39"              "IGLV8-61"               "CCT6A"                  "E2F8"                   "EZH2"                   "IGLV5-48"               "CLDN3"                  "RP11-301G19.1"          "PHB"                    "RACGAP1"                "ZNF521"                 "RBBP8"                  "KLRC2"                  "HPGD"                   "TSC22D1"                "RASGRP3"                "IGHV4-28"               "H1-10"                  "ITGA2B"                 "EVA1B"                  "BANF1"                 
 [706] "TKT"                    "MND1"                   "SLIRP"                  "MTFR2"                  "KCNH2"                  "EIF2S2"                 "SPAG5"                  "ARF4"                   "DNAJB11"                "IGFBP7"                 "IFI44L"                 "DDOST"                  "PRR11"                  "TRGC2"                  "LDHB"                   "PDXK"                   "CCT2"                   "BRCA1"                  "CD14"                   "BCL2L12"                "PIF1"                   "PRKAR2B"                "NR4A1"                  "CTLA4"                  "NEXN"                   "IGHV3-33"               "ACOT7"                  "OAS1"                   "CENPH"                  "SPATS2"                 "ANGPT1"                 "USP1"                   "CLU"                    "THBD"                   "ALDH1A1"                "C11orf96"               "TPM4"                   "MYB"                    "CHPF"                   "RBM24"                  "BLVRB"                  "PPM1G"                  "CXCL13"                 "TRIP13"                 "CHAF1A"                 "IGKV2-24"               "FKBP4"                 
 [753] "CSTA"                   "TPD52"                  "IER2"                   "OSTN-AS1"               "CEBPD"                  "SERPINH1"               "LGALS2"                 "PLAAT2"                 "ATAD5"                  "KIF20A"                 "SMC1A"                  "MTCH2"                  "CCNF"                   "TIMM10"                 "TRBV13"                 "SMC3"                   "ICOS"                   "MRPL37"                 "IGHV2-70"               "SNRPF"                  "OTUD1"                  "TMPRSS11E"              "HSPB11"                 "ITGA2"                  "ELL2"                   "TXNDC15"                "SLC25A4"                "SELENOS"                "H4C8"                   "ID3"                    "LAIR2"                  "ATF5"                   "GLDC"                   "PPP1R2C"                "CNN2"                   "TAMALIN"                "HLA-DMA"                "AXL"                    "HDLBP"                  "DUSP5"                  "F2RL3"                  "SHMT1"                  "CCDC34"                 "DUSP6"                  "IFITM3"                 "TSPAN13"                "NPM3"                  
 [800] "E2F7"                   "SERPINB1"               "RPS4Y1"                 "PDGFA"                  "TMEM156"                "IL21-AS1"               "MELK"                   "H1-1"                   "HSPA8"                  "ENSG00000277856.1"      "IGHV3-20"               "CIBAR2"                 "TYMP"                   "MAP1B"                  "GZMB"                   "CCR6"                   "PASK"                   "IL1B"                   "POLR3K"                 "RALGPS2"                "DUSP2"                  "CORO1B"                 "CANX"                   "CDC25B"                 "CCT3"                   "PRKDC"                  "MME"                    "NFKBIA"                 "ADGRG6"                 "UNC13C"                 "CCDC88A"                "PRELID1"                "CLC"                    "SPON2"                  "ANXA3"                  "NCAPD2"                 "POC1A"                  "TIMP3"                  "ORC1"                   "BZW2"                   "ARL6IP1"                "FXYD2"                  "NPIPA2"                 "CPEB4"                  "COCH"                   "CCNE1"                  "YEATS4"                
 [847] "TCF19"                  "AC004540.4"             "WDR76"                  "TGFB1I1"                "FAM178B"                "OIP5"                   "ETV1"                   "CALU"                   "SPDL1"                  "TSPAN12"                "VSIG4"                  "CCR3"                   "PMVK"                   "FOXP3"                  "SEC61A1"                "SUPT16H"                "FKBP3"                  "TLR10"                  "CD59"                   "ELFN1-AS1"              "IL6R"                   "RAB13"                  "PSMG1"                  "MGST1"                  "RAB32"                  "LAP3"                   "ZNF683"                 "MIR4435-2HG"            "NUDT5"                  "G0S2"                   "MRPS34"                 "LSM5"                   "GP1BA"                  "CDCA7L"                 "H2AC11"                 "POLD2"                  "DPPA4"                  "TRIM55"                 "PTPRD"                  "BATF"                   "RP5-1028K7.2"           "NEGR1"                  "PSMD1"                  "PAQR4"                  "FUT7"                   "SNRPB"                  "IGHV3-11"              
 [894] "CISD2"                  "SSPN"                   "VWDE"                   "ST6GALNAC4"             "GRN"                    "TNFRSF18"               "NEFL"                   "HSD17B10"               "GNA15"                  "NFKBID"                 "CES1"                   "PSMA3"                  "CD69"                   "ALG5"                   "RCC1"                   "EIF2S1"                 "B9D1"                   "ZNF215"                 "TRIM28"                 "FAM3C"                  "MT1H"                   "ZG16B"                  "ALDH1L2"                "BRCA2"                  "VCAM1"                  "LY6G6F"                 "KCNMA1"                 "RFC5"                   "VRK1"                   "BARX1"                  "RFC2"                   "CPXM1"                  "PSAT1"                  "NOP56"                  "HADH"                   "MTHFD2"                 "EGR3"                   "PCDH9"                  "TRBV28"                 "SEL1L3"                 "TRAV20"                 "TIMELESS"               "KIF20B"                 "CLEC11A"                "CDK6"                   "ENSG00000276345.1"      "CRIP1"                 
 [941] "COPS3"                  "SESN3"                  "SYK"                    "CTB-50L17.16"           "MRC1"                   "COPB2"                  "TRDC"                   "ATIC"                   "IFI27L1"                "COL6A2"                 "KCNQ1OT1"               "CALD1"                  "MIR193BHG"              "NPTX2"                  "NUCKS1"                 "TP53INP1"               "CENPX"                  "CISD1"                  "VOPP1"                  "RUVBL2"                 "ASS1"                   "TAL1"                   "HIRIP3"                 "LGALSL"                 "GCSH"                   "MPP1"                   "CLECL1"                 "KCTD12"                 "RP11-160E2.6"           "NPDC1"                  "TSHZ2"                  "CLDND1"                 "FPR1"                   "MYLK"                   "CFD"                    "RPPH1-ENSG00000259001"  "ADAM19"                 "MRPL13"                 "CKAP2"                  "INCENP"                 "ITGB3BP"                "LAG3"                   "TIMP2"                  "NCAPG2"                 "TPSB2"                  "LXN"                    "BOLA3"                 
 [988] "HPRT1"                  "MMRN1"                  "GSN"                    "RHAG"                   "TMEM176B"               "MS4A2"                  "GSTM3"                  "FAM43A"                 "NUCB2"                  "MCUR1"                  "CXCL8"                  "VPREB3"                 "ENSG00000275063.1"      "RASD1"                  "YWHAQ"                  "STIL"                   "TNFSF9"                 "NCAPD3"                 "RUVBL1"                 "SLC1A5"                 "UQCC2"                  "SH2D1A"                 "HOXA5"                  "IFI6"                   "HPGDS"                  "VDAC3"                  "FANK1"                  "ATAD3A"                 "SPINT2"                 "LSM2"                   "ACAT2"                  "TEX9"                   "GSTO1"                  "ENAM"                   "RGS13"                  "TRBV23-1"               "TRBV29-1"               "H2BC4"                  "KLRB1"                  "RMI2"                   "EXO1"                   "IGHV7-4-1"              "PXMP2"                  "KEL"                    "TRAV40"                 "F11-ENSG00000088926"    "PRTG"                  
[1035] "SHC4"                   "LINC01623"              "TRAV29DV5"              "SEPHS1"                 "AHSA1"                  "PTMS"                   "NFE2"                   "LGMN"                   "PRPF19"                 "TRBV20-1"               "ACTL6A"                 "RNF130"                 "MED12L"                 "MYL6B"                  "STT3A"                  "B3GNT7"                 "RPS27L"                 "SLC44A1"                "CHI3L2"                 "STAT1"                  "AC104024.1"             "SELENOP"                "TCEAL9"                 "MS4A4A"                 "RP11-1399P15.1"         "KRT86"                  "IL22"                   "AJ006998.2"             "HPSE"                   "MAL"                    "SNRPD1"                 "ISG20"                  "CTSG"                   "CRTAM"                  "CCNA1"                  "BARD1"                  "LINC01485"              "MSH6"                   "H2BC11"                 "NOSIP"                  "STIP1"                  "HAUS1"                  "TOMM40"                 "CRIP2"                  "IGHV1-69"               "IGFBP4"                 "NEFM"                  
[1082] "SMOX"                   "LINC01943"              "ZNF704"                 "TRBV21-1"               "IL2RA"                  "DDX11"                  "SPARC"                  "GPR183"                 "BCAT1"                  "TOX2"                   "MRPL4"                  "RP11-386I14.4"          "LRRC25"                 "TRAV38-2DV8"            "POLR2H"                 "H2BU1"                  "IL17RB"                 "MMP2"                   "AC015969.3"             "HOXC9"                  "KCNK15-AS1"             "HAT1"                   "GYPA"                   "CH17-373J23.1"          "RP1-34B20.21"           "CMSS1"                  "NCF2"                   "TSPYL2"                 "FAM136A"                "SLC27A2"                "MDH1"                   "ATF3"                   "RGS2"                   "MET"                    "OASL"                   "EPRS1"                  "GALNT14"                "CD4"                    "PERP"                   "TMEM14C"                "HPDL"                   "MRTO4"                  "C19orf48"               "OXCT2"                  "CDK2AP2"                "EGR4"                   "GDF10"                 
[1129] "UNC5B-AS1"              "NUP37"                  "IRF8"                   "CEP152"                 "ICAM1"                  "PHACTR1"                "TIPIN"                  "KIR2DL3"                "SOCAR"                  "RAD51C"                 "CDKN2A"                 "S100A11"                "CHN1"                   "GOT2"                   "LY86"                   "KCNK17"                 "CYCS"                   "LINC00582"              "ZNF331"                 "MAGOHB"                 "CENPP"                  "IGLV3-27"               "ST6GAL2"                "RP11-480C16.1"          "HLA-DPA1"               "ZFP36"                  "GNL3"                   "CD36"                   "AKAP12"                 "EEF1E1"                 "IMPA2"                  "SCN3B"                  "CHST2"                  "CCL23"                  "PRIM1"                  "TRAV30"                 "ROR2"                   "TGFBI"                  "TMEM45A"                "AKR1C3"                 "TAP1"                   "PDCD1"                  "TIGIT"                  "MRPL22"                 "TNFRSF9"                "CD86"                   "CRYBA4"                
[1176] "CHST15"                 "H2BC3"                  "PARVB"                  "GLRX5"                  "MT1X"                   "HMOX1"                  "RP11-251M1.1"           "KCNIP4-IT1"             "KIR3DL2"                "CENPS"                  "PSMD14"                 "THOC3"                  "VEGFA"                  "IL2"                    "SOCS3"                  "CCNE2"                  "MRPS6"                  "H4C15"                  "CD320"                  "POLA2"                  "FOXC1"                  "PKIB"                   "FDPS"                   "CAMK1"                  "IGLV1-44"               "CLEC9A"                 "TRAV38-1"               "SPTSSB"                 "DKC1"                   "KIR2DL1"                "FBXO16"                 "CDC25A"                 "NSD2"                   "SNAI1"                  "CD160"                  "MTHFD1L"                "SYT1"                   "PGAP4"                  "CEND1"                  "IL18"                   "KDM6B"                  "OXCT1"                  "RPA1"                   "CFAP43"                 "F13A1"                  "NAP1L1"                 "HMGN1"                 
[1223] "DUSP1"                  "AP1S2"                  "CIT"                    "LMNB2"                  "NR4A2"                  "LIMCH1"                 "RAPGEF4"                "EPB41L4B"               "PCDH17"                 "KIAA0087"               "INHBA"                  "KIAA1549"               "ASCL1"                  "MTUS2-AS1"              "RP11-538D16.3"          "KIF4B"                  "TP73-AS3"               "RP11-469A15.2"          "DEPDC1-AS1"             "RP11-298E9.5"           "RP11-569G13.2"          "ARNTL2-AS1"             "LINC01511"              "LINC02345"              "RP4-555D20.4"           "RP1-136B1.1"            "U91328.22"              "RP11-2H3.6"             "BCL7A"                  "RP11-879F14.2"          "RUBCNL"                 "RBBP7"                  "NEFH"                   "DSCC1"                  "EBP"                    "FCER2"                  "TRBC1"                  "DLGAP1"                 "BEX3"                   "MESD"                   "RFC3"                   "MS4A3"                  "INKA1"                  "PARPBP"                 "CLIC3"                  "ACKR3"                  "HLA-DMB"               
[1270] "MEG3"                   "CCR9"                   "SIRPG"                  "CDC25C"                 "PLK4"                   "SPATS2L"                "HCK"                    "NMU"                    "KIR2DL4"                "ACOXL"                  "GSTP1"                  "CLDN5"                  "IL32"                   "NRARP"                  "CHAC2"                  "HYOU1"                  "HACD1"                  "MPEG1"                  "RAMP1"                  "GMPS"                   "HCAR3"                  "SCN9A"                  "PLAUR"                  "CCDC168"                "TTN"                    "NELL2"                  "RP11-902B17.1"          "KLK1"                   "TRBV18"                 "UNG"                    "TNFAIP2"                "GATA3-AS1"              "CX3CR1"                 "SPACA3"                 "EPB42"                  "AK8"                    "LAYN"                   "LMO7-AS1"               "PGD"                    "KB-1980E6.3"            "CCDC28B"                "IFNGR2"                 "RBM47"                  "MAP3K8"                 "TNFSF10"                "IGHV1-69-2"             "PLAU"                  
[1317] "MYADM"                  "TPM1"                   "KRT81"                  "SYCE1"                  "SLC40A1"                "LINC02362"              "GTSF1"                  "IFNG-AS1"               "PLXDC2"                 "TCHH"                   "CALN1"                  "FANCA"                  "IGHV4-61"               "LINC01250"              "ANKRD28"                "PCSK1N"                 "SMIM1"                  "FANCD2"                 "IER5L"                  "AKR1B1"                 "OGN"                    "MAF"                    "MIXL1"                  "RP11-279O9.4"           "MTRNR2L1"               "NUPR2"                  "SAMD1"                  "MAFF"                   "GAL"                    "CENPK"                  "ANPEP"                  "TIFA"                   "CD68"                   "GFI1B"                  "XAF1"                   "SCAMP5"                 "ITGAX"                  "POLE2"                  "EPSTI1"                 "SLC1A4"                 "TRH"                    "RP13-192B19.2"          "CTD-2574D22.7"          "PRSS2"                  "CD83"                   "TPO"                    "KRT8"                  
[1364] "DHRS9"                  "DOK5"                   "CHRM3"                  "C15orf48"               "LTK"                    "MMS22L"                 "CHAF1B"                 "OR2B6"                  "TRBV3-1"                "DONSON"                 "PIMREG"                 "PAX5"                   "WNT4"                   "TNFSF13B"               "ITGA1"                  "SLC7A5"                 "RSAD2"                  "CEP128"                 "APOBEC3H"               "CENPV"                  "PF4V1"                  "LCN2"                   "H2AC17"                 "DAAM1"                  "PRF1"                   "H3C10"                  "SIT1"                   "PLEK"                   "CLMN"                   "MYCT1"                  "RPL39L"                 "DNTT"                   "CAP2"                   "CDH9"                   "COLEC11"                "TCN1"                   "HMCN1"                  "CCDC39-ENSG00000145075" "SLC26A7"                "P2RY6"                  "DIP2C-AS1"              "TEAD1"                  "COL4A5"                 "CR1L"                   "SPINK13"                "RP5-1119O21.2"          "RP11-462B18.2"         
[1411] "LGR4-AS1"               "LINC02763"              "ARLNC1"                 "AC005780.1"             "MIR924HG"               "RP3-483K16.4"           "JDP2"                   "SPRY1"                  "CEACAM4"                "SLC16A14"               "CXCL3"                  "BUB3"                   "LINC01447"              "FSCN1"                  "SLAMF7"                 "CCDC68"                 "MAP3K7CL"               "DST"                    "CDKN1C"                 "CD82"                   "ALOX12"                 "RGS10"                  "IGHV2-70D"              "IGLC6"                  "RBM20"                  "NFIB"                   "PIM1"                   "TLE1"                   "SLC11A1"                "VPS9D1-AS1"             "HMGN5"                  "CDK2"                   "TPM2"                   "SEPTIN11"               "CCL18"                  "RAD54L"                 "WDR62"                  "CBR1"                   "DSG2"                   "CLIC2"                  "PLBD1"                  "FHL1"                   "HMGA2"                  "TRBV6-1"                "CTA-992D9.11"           "TMEM167A"               "ZBP1"                  
[1458] "KIR3DX1"                "CYP2C18"                "SMIM43"                 "SAGE1"                  "GRID1"                  "AJAP1"                  "TDRG1"                  "TRBV11-1"               "CATSPERZ"               "NFE4"                   "AF015262.2"             "FRGCA"                  "IGKV2D-30"              "IGKV1D-33"              "IGKV2D-24"              "RP11-268P4.5"           "IGHVIII-38-1"           "BBOX1-AS1"              "RP11-113K21.4"          "SLC25A21-AS1"           "CTC-453G23.8"           "AC003002.6"             "RP11-378A12.1"          "RP11-434E6.4"           "ERICH3"                 "ADTRP"                  "SPAG4"                  "SYNM"                   "SAMD7"                  "TRBV12-4"               "CREM"                   "VIPR2"                  "PKIG"                   "MYO1D"                  "NFKBIZ"                 "ILK"                    "CSF3R"                  "CENPL"                  "HAVCR2"                 "KCNK6"                  "CYTOR"                  "C20orf27"               "CTAG2"                  "N4BP3"                  "BAALC"                  "CTSB"                   "FOLR3"                 
[1505] "LEF1"                   "MPO"                    "IL7R"                   "TRAV22"                 "ACP5"                   "EGFL6"                  "EGR2"                   "STON2"                  "RP11-492E3.2"           "CTDSPL"                 "FES"                    "BRIP1"                  "XRCC2"                  "SLC7A11"                "LINC01980"              "DNAJC12"                "EGLN3"                  "METTL7A"                "GAPT"                   "NXPH4"                  "MS4A7"                  "HACD3"                  "RECQL4"                 "HOMER3"                 "GADD45G"                "AIM2"                   "S100A10"                "SMCO4"                  "REG4"                   "LILRB2"                 "TMIGD2"                 "CREB5"                  "LILRA5"                 "PSMC3IP"                "CYP2S1"                 "NRP2"                   "UNQ6494"                "IL12A-AS1"              "CFP"                    "CPPED1"                 "HOXA11"                 "CYP4B1"                 "ACSM3"                  "C21orf58"               "NAMPT"                  "CDH1"                   "PFKP"                  
[1552] "MIR222HG"               "DOK3"                   "LRRIQ1"                 "RP3-416J7.4"            "ARX"                    "PREX2"                  "GPR12"                  "KLF15"                  "SLC16A9"                "IFIT1B"                 "TRBV6-6"                "MNX1-AS2"               "LINC01936"              "GXYLT1P5"               "AC141930.3"             "TLCD5"                  "ATP1B1"                 "CLEC12A"                "PALD1"                  "GCSAML"                 "HMBS"                   "PEBP1"                  "PTPN3"                  "LIM2"                   "TYROBP"                 "POLR3G"                 "TRBC2"                  "WNT11"                  "FXYD7"                  "WHRN"                   "HERPUD1"                "MT1F"                   "LY96"                   "KRT1"                   "MCEMP1"                 "LGALS3BP"               "AP3M2"                  "GALM"                   "FLT3"                   "TRAV12-2"               "HBQ1"                   "CD40LG"                 "ADA"                    "GADD45B"                "GIMAP4"                 "GBP2"                   "FCGRT"                 
[1599] "TRABD2A"                "PYCARD"                 "PLSCR1"                 "ESPL1"                  "ETV7"                   "SWAP70"                 "FGF18"                  "RP11-649E7.5"           "CD28"                   "CSF2RB"                 "POU2F2"                 "TRAV6"                  "TCERG1L"                "DHRS2"                  "LINC01857"              "KLRF1"                  "GNG3"                   "DEPP1"                  "MXRA8"                  "TRAV5"                  "RBP7"                   "HEMGN"                  "CLDN14"                 "TRBV6-4"                "TAFA5"                  "LINC00685"              "TRGV3"                  "LRRN3"                  "TP73"                   "DDAH2"                  "TUBB6"                  "ITM2A"                  "TP63"                   "CD22"                   "IRAK3"                  "TWIST1"                 "SAP30"                  "CTTN"                   "CMC1"                   "DENND2D"                "IRF7"                   "GMPR"                   "ABCA1"                  "SLC4A10"                "ST14"                   "CXCL2"                  "KYNU"                  
[1646] "DBN1"                   "ZNF571"                 "DDX3Y"                  "SLC16A1-AS1"            "SELENBP1"               "SQOR"                   "PACSIN1"                "PGRMC1"                 "RAB30"                  "PID1"                   "RCAN3"                  "BHLHE40"                "IL5"                    "MDH1B"                  "TRPC6"                  "CGN"                    "PRSS35"                 "SLC30A8"                "GAB4"                   "LINC01933"              "RP11-214K3.24"          "LA16c-306E5.2"          "UAP1"                   "SASH1"                  "IL6"                    "SFRP5"                  "NRIP1"                  "IL4I1"                  "RP11-373D23.3"          "BATF3"                  "NRP1"                   "LPP-AS1"                "FERMT1"                 "SH3BP4"                 "LINC00891"              "RNASEH2A"               "ETFA"                   "NET1"                   "SLC9A3R1"               "ADAM28"                 "IL1R2"                  "CYP1B1"                 "CORO1C"                 "RP11-106M7.1"           "TDRD15"                 "NLRP3"                  "GBP4"                  
[1693] "ST6GALNAC1"             "TMEM176A"               "WEE1"                   "LINC02865"              "PROK2"                  "AFF3"                   "RAB31"                  "OSM"                    "TNFRSF13C"              "GSPT1"                  "C16orf74"               "ZC3H12A"                "ZNF80"                  "LILRB3"                 "GINS1"                  "SCARB2"                 "OTOF"                   "HSBP1"                  "DAB2"                   "H2BC15"                 "TMCC2"                  "LINC02195"              "TRBV24-1"               "EDA"                    "BMP2"                   "SAT1"                   "FCGR3A"                 "TEDC2"                  "HRH2"                   "SLC25A37"               "MARCKSL1"               "FAM184A"                "FAM83D"                 "FGL2"                   "KRT7"                   "PRR15"                  "TRBV6-5"                "RP11-143I21.1"          "DPP4"                   "RP11-108L7.15"          "OAS2"                   "MAGEB2"                 "IGKV6-21"               "TEX49"                  "MRC2"                   "SELP"                   "PNP"                   
[1740] "IFI35"                  "TENT5C"                 "ANK1"                   "TTC39C-AS1"             "KCTD14"                 "CNTNAP2"                "NR4A3"                  "CASP3"                  "CR1"                    "LL22NC03-2H8.5"         "KRT18"                  "ARHGAP6"                "PIM3"                   "ENC1"                   "PLOD2"                  "NOP58"                  "IL13RA1"                "ZP1"                    "PYGL"                   "TRAV17"                 "DMXL2"                  "LIMS2"                  "CPM"                    "IGSF11"                 "METTL7B"                "FAM133A"                "C10orf67"               "RP11-474I16.8"          "AC005616.1"             "RP3-333B15.5"           "TRBV15"                 "SEPTIN3"                "F2-ENSG00000180210"     "RP11-104J23.1"          "RP11-214O1.3"           "FGFBP2"                 "C12orf45"               "PLK2"                   "POU3F1"                 "RRAD"                   "PDZK1IP1"               "KLF10"                  "WDFY4"                  "PECAM1"                 "C12orf75"               "DAPP1"                  "CITED2"                
[1787] "TMEM109"                "S100A4"                 "LINC02397"              "FXYD1"                  "HBEGF"                  "IGHV3-64"               "NEAT1"                  "MECOM"                  "RP11-693J15.6"          "ADAMTS2"                "ASGR1"                  "PARD6G"                 "IGFBP2"                 "PHLDA1"                 "H2AW"                   "MARCHF1"                "GM2A"                   "PAGE5"                  "JAML"                   "KLF4"                   "GTSCR1"                 "CD163"                  "RP6-91H8.3"             "HERC5"                  "TRAV25"                 "TRAV13-1"               "SYNGR1"                 "LYN"                    "CD93"                   "TFRC"                   "ANKRD9"                 "SYNJ2"                  "GLB1L3"                 "LY6E"                   "TRAV1-2"                "ARG2"                   "GALR3"                  "CREG1"                  "PTX3"                   "COL5A3"                 "CTSA"                   "SMIM3"                  "HNMT"                   "PPP1R9A"                "RASD2"                  "OPRM1"                  "PRSS33"                
[1834] "BICDL1"                 "HSPB1"                  "PHLDA2"                 "SORBS2"                 "DYSF"                   "RAB11A"                 "SCNN1B"                 "RPRM"                   "SPDYC"                  "BACE2"                  "C5AR1"                  "GCA"                    "FFAR3"                  "ARC"                    "TRBV5-6"                "RP11-342K6.4"           "TRBV25-1"               "LINC00926"              "TCEAL2"                 "TIMP1"                  "TRAV9-2"                "GATA3"                  "FGF14"                  "CLEC7A"                 "ETS2"                   "TRMT6"                  "EDN3"                   "USP18"                  "APOBEC3G"               "CD302"                  "GZMH"                   "STRBP"                  "LGALS9"                 "GAS2L1"                 "SFXN1"                  "CISH"                   "RP11-605B16.1"          "SLC7A7"                 "FKBP5"                  "TUBA4A"                 "ADGRE2"                 "XK"                     "F5-ENSG00000198734"     "IRS2"                   "PITPNM2-AS1"            "HSD11B1"                "TEX48"                 
[1881] "SH2D1B"                 "RBPMS2"                 "HLA-DPB1"               "IL2RB"                  "GLA"                    "GPR155"                 "TICRR"                  "DNAJA4"                 "PDE4B"                  "PROS1"                  "SMIM14"                 "C19orf38"               "CD5"                    "PTGS1"                  "ALCAM"                  "TRAT1"                  "CCDC80"                 "CREG2"                  "NIBAN3"                 "CYP4F3"                 "NEIL1"                  "SERTAD1"                "PAK1"                   "OSBPL10"                "EBF1"                   "CDA"                    "CCR2"                   "TMEM70"                 "GBP5"                   "TUBB2A"                 "MIR155HG"               "TLCD4"                  "LDLR"                   "LINC01934"              "SAMSN1"                 "MAP3K20"                "TOB1"                   "TRIM58"                 "PLK3"                   "FCGR1A"                 "C1orf162"               "MIR223HG"               "FRMD4B"                 "HDAC9"                  "LGALS14"                "AUTS2"                  "RAB27B"                
[1928] "GPR171"                 "SKIL"                   "MID1IP1"                "SAMD9L"                 "DEFA3"                  "SLC43A2"                "PMEPA1"                 "SOCS1"                  "IL3RA"                 

We can see how by recomputing the HVG we replace 33% of the HVG genes which capture the variability present in the T cell subset.

It looks like UMAP_1 is clearly separating 2 populations… let’s look at some QC metrics to see what might be going on by looking at the predicted.id

DimPlot(
    tse_sub,
    dims = c(1, 2),
    reduction = "pca",
    group.by = "predicted.id",
    label = TRUE)

It appears that we have some myeloid and B cells within our T cell subset…. Let’s check the PCA loadings

# Examine and visualize PCA results a few different ways
print(tse_sub[["pca"]], dims = 1:10, nfeatures = 15)
PC_ 1 
Positive:  TYMS, PCLAF, MKI67, BIRC5, ACTB, STMN1, RRM2, TK1, CDT1, ZWINT 
       MYBL2, PFN1, CDK1, PSME2, GAPDH 
Negative:  IL7R, CCR7, LTB, RCAN3, MAL, JUNB, RGCC, TRABD2A, ZNF331, TSHZ2 
       LEF1, CD69, SESN3, MYADM, RPS18 
PC_ 2 
Positive:  LTB, IL7R, RCAN3, RPS2, RPLP0, RPS18, CCR7, RPS5, MAL, LDHB 
       GPR183, LEF1, COTL1, CD28, TYMS 
Negative:  GZMB, PRF1, FGFBP2, FCGR3A, GNLY, GZMH, GZMA, TYROBP, SPON2, CLIC3 
       KLRF1, CCL4, FCER1G, TRDC, PLEK 
PC_ 3 
Positive:  TYROBP, MAP3K8, B3GNT7, MAFF, BHLHE40, IGFBP7, KLRF1, KLRB1, IER2, IER5 
       TAMALIN, FCER1G, SPON2, AREG, TYMS 
Negative:  TRAC, IL32, CD52, S100A8, VIM, S100A9, S100A11, LDHB, RPS4Y1, TMSB10 
       RPLP1, COTL1, LTB, RPS2, HBB 
PC_ 4 
Positive:  ZFP36, JUNB, CD69, DUSP2, JUN, NR4A2, DUSP1, IER2, NFKBIA, FOS 
       RGS1, CREM, IER5, PPP1R15A, ZNF331 
Negative:  CX3CR1, MT-ATP6, FCGR3A, KLRC2, PLEK, SYNGR1, TRDC, FGFBP2, TMPO, MTRNR2L12 
       RRM2, ASCL2, UBE2C, MYBL2, H1-3 
PC_ 5 
Positive:  IL32, TMSB4X, CD8A, S100A4, CD52, HLA-DRB1, GZMH, HLA-DPB1, CD8B, GZMA 
       TRGC2, S100A10, HLA-DPA1, DUSP2, TRBC2 
Negative:  MZB1, JCHAIN, TNFRSF17, IGKC, ITM2C, POU2AF1, DERL3, IGHG4, IGHG1, IGHG3 
       TNFRSF13B, COBLL1, PNOC, IGLC2, IGHA1 
PC_ 6 
Positive:  HLA-DRB1, MZB1, JCHAIN, HERPUD1, CD8A, HLA-DPB1, HLA-DPA1, TNFRSF17, GZMK, TRGC2 
       CD8B, HLA-DRA, HLA-DQB1, DUSP2, CD74 
Negative:  FCER1G, NRGN, PPBP, CAVIN2, IGFBP7, PF4, MAL, CCR7, TUBB1, TMIGD2 
       ACTN1, LTB, BEX3, RPS4Y1, GNG11 
PC_ 7 
Positive:  RPL37A, RPS18, RPLP1, RPS5, LTB, RPS2, IGFBP7, NPM1, MAL, C1orf162 
       PTMA, KLRF1, RPLP0, RCAN3, TMIGD2 
Negative:  NRGN, PPBP, CAVIN2, PF4, GNG11, TUBB1, S100A9, S100A8, SPARC, TREML1 
       GP9, CLU, ACRBP, RGS18, CD8A 
PC_ 8 
Positive:  S100A8, S100A9, LYZ, S100A12, VCAN, HBB, MNDA, FOS, GBP1, NFKBIA 
       STAT1, CXCL8, TYMP, TMEM176B, DUSP1 
Negative:  PF4, CAVIN2, PPBP, NRGN, TUBB1, GNG11, GP9, TREML1, ACRBP, CMTM5 
       SPARC, MYL9, PRKAR2B, CLU, TMEM40 
PC_ 9 
Positive:  IFI44L, XAF1, ISG15, MTRNR2L12, NEAT1, DUSP4, MX1, IFI16, CREM, EPSTI1 
       ZNF331, CAPZA1, RGS1, AUTS2, IRF4 
Negative:  MYADM, FTL, TAGLN2, RPS5, RPS2, DUSP1, MT-CO3, FOS, MT-CYB, LMNA 
       FOSB, MT-CO2, MIR23AHG, PPP1R15A, RGCC 
PC_ 10 
Positive:  MTRNR2L12, TMSB4X, S100A4, CRIP1, IFI44L, ITGB1, IL32, S100A10, LMNA, XAF1 
       S100A11, DLGAP5, LGALS1, CENPA, UBE2C 
Negative:  MT-CO3, AIF1, MT-CO2, GINS2, FABP5, CD8A, CD8B, E2F1, CDCA7L, MCM2 
       SOX4, HMGA1, MCM5, CDT1, MCM4 

Some interesting PCs are the following:

  • PC1 contains a lot of proliferation related genes
  • PC5 contains a lot of B cell related genes in the negative loadings
  • PC8 contains a lot of myeloid genes with high loadings
DimPlot(
    tse_sub,
    dims = c(1, 5),
    reduction = "pca",
    group.by = "predicted.id",
    label = TRUE) | DimPlot(
    tse_sub,
    dims = c(1, 8),
    reduction = "pca",
    group.by = "predicted.id",
    label = TRUE)

Let’s check some QC metrics to assess if we have potential doublets

VlnPlot(
    tse_sub,
    features = c("pANN", "nCount_RNA", "nFeature_RNA"),
    group.by = "RNA_snn_res.0.15",
    pt.size = 0
)

Clusters 6 & 7 have elevated pANN scores while cluster 3 has slightly elevated library complexity. Let’s check some cell type genes to determine what might be going on:

ct_genes <- c(
    # Proliferating genes
    "MKI67", "TOP2A", "STMN1",
    # B cell genes
    "CD79A", "CD79B", "MS4A1", 
    # Myeloid genes
    "S100A8", "S100A9", "CD14",
    "FCGR3A", "LYZ", "VCAN",
    # T cell genes
    "CD3D", "CD3E", "CD8B",
    "TRAC", "TRBC1", "TRDC",
    # NK genes
    "NCR1", "NCR2", "NCR3",
    "KLRF1", "KLRC2"
)

Seurat::DotPlot(
  object = tse_sub,
  features = ct_genes,
  group.by = "RNA_snn_res.0.15",
  col.min = 0,
  dot.min = 0) +
  ggplot2::scale_x_discrete(
    breaks = ct_genes) +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 60, hjust = 1)) +
  ggplot2::labs(x = "", y = "")

Clusters 6 and 7 seem to be proliferating cells. 6 appeard to have a mix of T and myeloid cells and 7 are B cells. Moreover, cluster 3 is simultaneously expressing Myeloid and T cell genes which could be an indication that these are doublets. Let’s double check these cells are in fact doublets:

# Scale gene expression of genes of interest
tse_sub <- ScaleData(tse_sub, features = ct_genes)
DoHeatmap(
    tse_sub,
    features = ct_genes,
    group.by = "RNA_snn_res.0.15")

Effectively, cluster 3 are doublets!

T cell clean up

The whole purpose of iterative subclustering is to capture the biological variability of a specific cell type. Clearly, our level 1 annotation of T cells included more than just T cells, for the prupose of this notebook let’s simulate a level-3 annotation where we removed clusters not of interest and focus on T cells!

# Keep only clusters 0, 1, 2, 4, 5
tse_sub <- tse_sub[, tse_sub$RNA_snn_res.0.15 %in% c(0, 1, 2, 4, 5)]

Reprocess it -

tse_sub <- tse_sub %>%
    FindVariableFeatures(
        method = "vst",
        nfeatures = 3000,
        verbose = FALSE) %>%
    ScaleData(verbose = FALSE, features = VariableFeatures(.)) %>% 
    RunPCA(
        npcs = 50,
        verbose = FALSE
    )

# Save HVG for T cells only
hvg_t <- VariableFeatures(tse_sub)

# Look at the elbow plot
ElbowPlot(tse_sub, ndims = 50)

tse_sub <- tse_sub %>% 
    FindNeighbors(dims = 1:30, verbose = FALSE) %>%
    FindClusters(resolution = c(0.15, 0.2, 0.25, 0.3), verbose = FALSE) %>%
    RunUMAP(dims = 1:30, verbose = FALSE)

Let’s take a look at these clusters

DimPlot(
    tse_sub,
    reduction = 'umap',
    group.by = c(
        "RNA_snn_res.0.15", "RNA_snn_res.0.2",
        "RNA_snn_res.0.25",  "RNA_snn_res.0.3"))

dim_sub <- DimPlot(
    tse_sub,
    reduction = 'umap',
    group.by = c("RNA_snn_res.0.15"))

As showcased in the clustering algorithm a data-drive way to guide the decision making of the right clustering would be to run the Silhouette Analysis. For the prupose of this notebook we are going to move forward with 0.15.

Annotation playground

Let’s find the marker genes of this new clustering

Idents(tse_sub) <- tse_sub$RNA_snn_res.0.15
mgs <- FindAllMarkers(
    tse_sub,
    logfc.threshold = 0.25,
    only.pos = TRUE,
    min.pct = 0.25)

DT::datatable(mgs, filter = "top")

Do you think you can annotate these new clusters….

Cluster 1

Let’s look at genes that are differentially expressed

c1 <- c(
    "KLRF1", "FCER1G", "FCGR3A",
    "TRDC", "GZMB", "GNLY")
FeaturePlot(
    tse_sub,
    features = c(c1),
    ncol = 3) +
    dim_sub

VlnPlot(
    tse_sub,
    features = c(c1),
    group.by = "RNA_snn_res.0.15") +
    dim_sub

Cluster 4

c4 <- c(
    "KLRF1", "FCER1G", "FCGR3A",
    "TIGIT", "CX3CR1", "KLRC2")

FeaturePlot(
    tse_sub,
    features = c(c4),
    ncol = 3) +
    dim_sub

VlnPlot(
    tse_sub,
    features = c(c4),
    group.by = "RNA_snn_res.0.15",
    pt.size = 0) +
    dim_sub

Cluster 7

This cluster seems to be expressin Hemoglobin and MHC2 genes….

c7 <- c(
    "KLRF1", "FCER1G", "FCGR3A",
    "HBB", "HBA1", "HBA2",
    "HLA-DQA1", "HLA-DQA2")

FeaturePlot(
    tse_sub,
    features = c(c7),
    ncol = 3) +
    dim_sub

VlnPlot(
    tse_sub,
    features = c(c7),
    group.by = "RNA_snn_res.0.15",
    pt.size = 0) +
    dim_sub

Annotate

tse_sub@meta.data <- tse_sub@meta.data %>%
  dplyr::mutate(
    annotation_lvl2 = dplyr::case_when(
      RNA_snn_res.0.15 == 0 ~ "NK",
      RNA_snn_res.0.15 == 1 ~ "",
      RNA_snn_res.0.15 == 2 ~ "",
      RNA_snn_res.0.15 == 3 ~ "",
      RNA_snn_res.0.15 == 4 ~ "NK TIGIT+/CX3CR1+/KLRC2+",
      RNA_snn_res.0.15 == 5 ~ "",
      RNA_snn_res.0.15 == 6 ~ "",
      RNA_snn_res.0.15 == 7 ~ "Doublets",
      RNA_snn_res.0.15 == 8 ~ ""
      )
  )

DimPlot(tse_sub, group.by = "annotation_lvl2")

Comparing PCA

Lastly, lets compare the PC spaces between using HVG selected from the whole data or from the cleaned up T cell subset

So let’s take a closer look at what changes in the PCA latent spaces. Maybe the information we need is there.

latent_full <- tse_full@reductions$pca@cell.embeddings
loadings_full <- tse_full@reductions$pca@feature.loadings
var_full <- tse_full@reductions$pca@stdev
t_full <- tse_full@reductions$pca@misc$total.variance


latent_sub <- tse_sub@reductions$pca@cell.embeddings
loadings_sub <- tse_sub@reductions$pca@feature.loadings
var_sub <- tse_sub@reductions$pca@stdev
t_sub <- tse_sub@reductions$pca@misc$total.variance

var_full_df <- data.frame(
    PC = 1:length(var_full),
    Variance = var_full^2 / t_full,
    Set = "Full HVG")
var_sub_df <- data.frame(
    PC = 1:length(var_sub),
    Variance = var_sub^2 / t_sub,
    Set = "Clean HVG")

# Calculate cumulative variance explained
var_full_df$CumulativeVariance <- cumsum(var_full_df$Variance)
var_sub_df$CumulativeVariance <- cumsum(var_sub_df$Variance)

# Combine the datasets
combined_variance <- bind_rows(var_full_df, var_sub_df)

# Plotting the variance explained and cumulative variance
ggplot(combined_variance, aes(x = PC)) +
    geom_line(aes(y = CumulativeVariance, color = Set), size = 1.2) +
    geom_point(aes(y = CumulativeVariance, color = Set), size = 2) +
    scale_color_manual(values = c("blue", "red")) +
    scale_linetype_manual(values = c("solid", "dashed")) +
    theme_minimal() +
    labs(title = "Cumulative Variance Explained",
         x = "Principal Component",
         y = "Proportion of Variance Explained") +
    guides(color = guide_legend(title = "Parameter Set"), 
           linetype = guide_legend(title = "Variance Type"))

Both seem to be very similar, and it makes sense, since they are gene sets obtained by selecting the most variable genes from their respective spaces. However, the biological variance they capture is very different! We can look at that by comparing how the PC loadings of genes of interest change.

# Function to compare PC loadings between 2 embeddings, it basically compares the difference between the max value of each gene between both PC spaces
compare_pca_loadings <- function(loadings1, loadings2, top_n = 20, cell_type_markers = NULL,
                                  loadings1_name = deparse(substitute(loadings1)), 
                                  loadings2_name = deparse(substitute(loadings2))) {
    all_genes <- union(rownames(loadings1), rownames(loadings2))

    # Initialize matrices to store aligned loadings with genes set to 0 by default
    loadings1_aligned <- matrix(0, nrow = length(all_genes), ncol = ncol(loadings1), 
                                dimnames = list(all_genes, colnames(loadings1)))
    loadings2_aligned <- matrix(0, nrow = length(all_genes), ncol = ncol(loadings2), 
                                dimnames = list(all_genes, colnames(loadings2)))

    # Fill in the existing values from each matrix
    loadings1_aligned[rownames(loadings1), ] <- loadings1
    loadings2_aligned[rownames(loadings2), ] <- loadings2

    # Check for zero values immediately after filling in values
    loadings1_zero <- apply(loadings1_aligned, 1, function(x) all(x == 0))
    loadings2_zero <- apply(loadings2_aligned, 1, function(x) all(x == 0))

    outlines <- character(length(all_genes))
    outlines[loadings1_zero] <- "orange"
    outlines[loadings2_zero] <- "blue"
    # outlines[loadings1_zero & loadings2_zero] <- "white" # Both are zero

    max_loadings1 <- apply(abs(loadings1_aligned), 1, max)
    max_loadings2 <- apply(abs(loadings2_aligned), 1, max)

    # Calculate the differences between the maximum absolute loadings across all PCs
    loadings_difference <- max_loadings2 - max_loadings1

    # Determine the direction of the difference
    directionality <- ifelse(loadings_difference > 0, paste("Higher in", loadings2_name), paste("Higher in", loadings1_name))

    fillpal <- c('steelblue','salmon')
    names(fillpal) = c(paste("Higher in", loadings2_name), paste("Higher in", loadings1_name))

    # Create a data frame for plotting
    genes_differences_df <- data.frame(
        Feature = rownames(loadings1_aligned),  # Assuming both matrices have the same rownames
        Difference = loadings_difference,
        Direction = directionality,
        Outline = outlines[match(rownames(loadings1_aligned), all_genes)]
    )

    genes_differences_df <- genes_differences_df %>% mutate(
                MaxAbsLoadings1 = ifelse(Direction == names(fillpal)[1], -max_loadings1, max_loadings1),
                MaxAbsLoadings2 = ifelse(Direction == names(fillpal)[2], -max_loadings2, max_loadings2))

    if (!is.null(cell_type_markers)) {
        marker_genes <- unlist(cell_type_markers, use.names = FALSE)
        genes_differences_df <- genes_differences_df %>%
            dplyr::filter(Feature %in% marker_genes) %>%
            dplyr::mutate(CellType = NA)  # Assign NA initially

        for (cell_type in names(cell_type_markers)) {
            genes_differences_df$CellType[genes_differences_df$Feature %in% cell_type_markers[[cell_type]]] <- cell_type
        }
        
        
        plot <- ggplot(
            genes_differences_df,
            aes(x = reorder(Feature, Difference), y = Difference, fill = Direction)) +
            geom_col() +
            coord_flip() +
            theme_minimal() +
            scale_fill_manual(values = fillpal) +
            labs(title = "Gene Loadings Differences by Cell Type",
                 subtitle = paste("Comparing", loadings1_name, "and", loadings2_name),
                 x = "Gene",
                 y = "Difference in Loadings") +
            guides(fill = guide_legend(title = "Where Loadings are Higher")) +
            facet_wrap(~CellType, scales = "free_y", ncol = 1, drop = TRUE)
    } else {
        genes_differences_df <- genes_differences_df %>%
            dplyr::arrange(desc(Difference)) %>%
            dplyr::slice(1:top_n)

        plot <- ggplot(
            genes_differences_df,
            aes(x = reorder(Feature, Difference), y = Difference, fill = Direction)) +
            geom_col(show.legend = FALSE) +
            scale_color_manual(
                name = "Outline Color",
                values = c("green" = "green", "blue" = "blue", "purple" = "purple"),
                labels = c(paste("Zero in", loadings1_name),
                           paste("Zero in", loadings2_name),
                           "Zero in Both")) +
            coord_flip() +
            theme_minimal() +
            scale_fill_manual(values = fillpal) +
            labs(
                title = paste("Top", top_n, "Genes with Greatest Differences in Loadings"),
                subtitle = paste("Comparing", loadings1_name, "and", loadings2_name),
                x = "Gene",
                y = "Difference in Loadings") +
            guides(fill = guide_legend(title = "Where Loadings are Higher"))
    }

    return(plot)
}

Let’s compare which genes change the most between the full HVG and clean HVG

# Ensure both matrices have the same genes in the same order
common_genes <- intersect(rownames(loadings_full), rownames(loadings_sub))
loadings_full_aligned <- loadings_full[common_genes, , drop = FALSE]
loadings_sub_aligned <- loadings_sub[common_genes, , drop = FALSE]

# Determine high loadings
cutoff_l <- quantile(abs(as.matrix(loadings_full_aligned)), 0.95)
cutoff_h <- quantile(abs(as.matrix(loadings_sub_aligned)), 0.95)

high_loadings_full <- apply(abs(loadings_full_aligned), 1, max) > cutoff_l
high_loadings_sub <- apply(abs(loadings_sub_aligned), 1, max) > cutoff_h

# Identify features that are high in _h but not high in _l
target_features <- names(which(high_loadings_sub & !high_loadings_full))

# Prepare loadings of these target features for plotting
target_loadings <- loadings_sub_aligned[target_features, , drop = FALSE]

# Convert to long format for ggplot using pivot_longer
loadings_fullong <- as.data.frame(target_loadings) %>%
  tibble::rownames_to_column("Feature") %>%
  pivot_longer(
    cols = -Feature,
    names_to = "PC",
    values_to = "Loading"
  )

# Example usage:
compare_pca_loadings(loadings_full, loadings_sub, top_n = 40)

Many of these genes are trained immunity genes. At this point we could keep some rare positive-control celltype in mind, like ILC3s or something else we might expect to find in this dataset

# Define a list of cell type markers
cell_type_markers <- list(
    Tcell = c("TRAC", "TRBC1", "TRBC2", "CD3D", "CD3E"),
    CD8cyto = c("CD8A", "CD8B", "GZMA", "NKG7", "GZMK"),
    CD4 = c("CD4"),
    Trespone = c("CD40LG", "CD28"),
    Naive = c("CCR7", "LEF1", "TCF7"),
    Treg = c("FOXP3", "CTLA4", "IL2RA"),
    NK = c("NCR1", "NCR2", "NCR3", "KLRF1", "KLRC2"),
    Myeloid = c("S100A8", "S100A9", "CD14", "FCGR3A", "LYZ", "VCAN"),
    Bcell = c("CD79A", "CD79B", "MS4A1")
)

compare_pca_loadings(loadings_full, loadings_sub, cell_type_markers = cell_type_markers)

And if we compare the loadings for genes of interest

compare_pca_loadings(loadings_full, loadings_sub, cell_type_markers = cell_type_markers)

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggalluvial_0.12.5    tictoc_1.2.1         RColorBrewer_1.1-3   scales_1.3.0         DT_0.33              colorBlindness_0.1.9 devtools_2.4.5       usethis_2.2.2        lubridate_1.9.2      forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2          readr_2.1.4          tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.0        tidyverse_2.0.0      Seurat_5.1.0         SeuratObject_5.0.2   sp_2.1-3            

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22       splines_4.3.1          later_1.3.2            polyclip_1.10-6        fastDummies_1.7.3      lifecycle_1.0.4        globals_0.16.2         processx_3.8.2         lattice_0.21-8         MASS_7.3-60            crosstalk_1.2.1        magrittr_2.0.3         limma_3.56.2           plotly_4.10.4          sass_0.4.8             rmarkdown_2.26         jquerylib_0.1.4        yaml_2.3.8             remotes_2.5.0          httpuv_1.6.14          sctransform_0.4.1      spam_2.10-0            sessioninfo_1.2.2      pkgbuild_1.4.2         spatstat.sparse_3.0-3  reticulate_1.35.0.9000 cowplot_1.1.3          pbapply_1.7-2          abind_1.4-5            pkgload_1.3.2.1        Rtsne_0.17             presto_1.0.0           ggrepel_0.9.5          irlba_2.3.5.1          listenv_0.9.1          spatstat.utils_3.0-4   goftest_1.2-3          RSpectra_0.16-1        spatstat.random_3.2-2  fitdistrplus_1.1-11    parallelly_1.37.0      leiden_0.4.3.1         codetools_0.2-19       tidyselect_1.2.0       farver_2.1.1           matrixStats_1.2.0      spatstat.explore_3.2-6 jsonlite_1.8.8         ellipsis_0.3.2         progressr_0.14.0       ggridges_0.5.6        
 [52] survival_3.5-7         tools_4.3.1            ica_1.0-3              Rcpp_1.0.12            glue_1.8.0             gridExtra_2.3          xfun_0.42              withr_3.0.0            fastmap_1.1.1          fansi_1.0.6            callr_3.7.3            digest_0.6.34          timechange_0.2.0       R6_2.5.1               mime_0.12              gridGraphics_0.5-1     colorspace_2.1-0       scattermore_1.2        tensor_1.5             spatstat.data_3.0-4    utf8_1.2.4             generics_0.1.3         data.table_1.15.4      prettyunits_1.1.1      httr_1.4.7             htmlwidgets_1.6.4      uwot_0.1.16            pkgconfig_2.0.3        gtable_0.3.4           lmtest_0.9-40          htmltools_0.5.7        profvis_0.3.8          dotCall64_1.1-1        png_0.1-8              knitr_1.45             rstudioapi_0.16.0      tzdb_0.4.0             reshape2_1.4.4         nlme_3.1-163           cachem_1.0.8           zoo_1.8-12             KernSmooth_2.23-22     parallel_4.3.1         miniUI_0.1.1.1         vipor_0.4.5            ggrastr_1.0.2          pillar_1.9.0           grid_4.3.1             vctrs_0.6.5            RANN_2.6.1             urlchecker_1.0.1      
[103] promises_1.2.1         xtable_1.8-4           cluster_2.1.4          beeswarm_0.4.0         evaluate_0.23          cli_3.6.3              compiler_4.3.1         rlang_1.1.3            crayon_1.5.2           future.apply_1.11.1    labeling_0.4.3         ps_1.7.5               plyr_1.8.9             fs_1.6.3               ggbeeswarm_0.7.2       stringi_1.8.3          viridisLite_0.4.2      deldir_2.0-2           munsell_0.5.0          lazyeval_0.2.2         spatstat.geom_3.2-8    Matrix_1.6-5           RcppHNSW_0.6.0         hms_1.1.3              patchwork_1.2.0        future_1.33.1          shiny_1.8.0            ROCR_1.0-11            igraph_2.0.2           memoise_2.0.1          bslib_0.6.1